Introduction

Businesses that thrive are relentless with creative, new ideas to expand opportunities for revenue. Here at Regork, we pride ourselves on continual and strategical improvements to not only bring in customers but to also retain them. Our current conversations regarding customer spending revolves around groups of customers that are frequent visitors of the store. Further, how can we increase revenue to frequent visitors but low spenders?

In order to find the answer to this question, we needed to identify the demographic group(s) that satisfied our requirement. We decided to target demographic age groups as they typically share many of the same characteristics and buying habits, so it would be more logical to run a targeted promotion at a specified demographic. Finding the age group that fit our criteria of frequenters but low spenders was done by monitoring total transactions and average basket sales value across the given age ranges. From there, a deep dive of specific products will be necessary to uncover relations between buying habits.

The best solution we uncovered would be to install a coupon/promotional campaign to 19-24 year olds that are already frequenting Regork. This is a very interesting population because they are among the highest frequenters of Regork; however, they are typically the age group spending the least amount of money. We are confident that this promotion will drive sales in the short-term and retain consumers for the long-term.

Prerequisites

Libraries

We will be analyzing data from the completejourney R library. The completejourney package contains grocery store shopping transaction information from 2,469 households from 2016-2017.

A variety of libraries will be used to assist in the formulation of our analysis.

library(completejourney)      # data set
library(dplyr)                # manipulating and transforming data  
library(forcats)              # working with factors & categorical variables
library(lubridate)            # functions for datetime data
library(ggplot2)              # data visualization
library(plotly)               # more customization of visualizations

Data

To access the full set of transaction data, we must create a dataframe referencing the full transaction table found in the completejourney GitHub repository.

transactions <- get_transactions()     #transaction data recorded at the product level

In addition to this table, we will also be using the following to assist in our analysis:

demographics           #household demographic data (age, income, etc.)
products               #product information (category, brand, etc.)
coupons                #all available coupons
coupon_redemptions     #coupons that have been redeemed

Exploratory Data Analysis (EDA)

To start our analysis, we first want to examine the total number of transactions. To best tell the story, we grouped these results by age range. The goal of this is to determine the age range that has room to grow and expand their purchasing habits.

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(age) %>%
  summarize(total_transactions = n()) %>%
  ggplot(aes(age, total_transactions)) +
    geom_col(fill = "dodgerblue") +
    scale_y_continuous(label = scales::comma) +
  labs(
    title = "Total Transactions by Age Group",
    x = "Age Group",
    y = "Total Transactions",
    caption = "Data: completejourney"
  ) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank())

As seen above, the 45-54 age group has the most total transactions in the data set. This is not terribly surprising as this demographic typically go to the grocery for their entire family. For the same reasons, it is not surprising to see the 35-44 and 25-34 follow. This means that the 19-24, 55-64, and 65+ age groups have the most room for growth.

This data only means so much if revenue generated by transactions is correlated with the total number of transactions. For example, a group could have plenty of transactions, but they could be buying dollar candy bars or something of the like, while another group is buying high ticket items at low transaction rates. It is imperative to see the revenue each age group generates.

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(age) %>%
  summarize(total_revenue = sum(sales_value)) %>%
  ggplot(aes(age, total_revenue)) +
    geom_col(fill = "dodgerblue") +
    scale_y_continuous(label = scales::dollar) +
    labs(
      title = "Total Revenue Generated by Age Group",
      x = "Age Group",
      y = "Total Revenue",
      caption = "Data: completejourney"
      ) +
    theme_minimal() +
    theme(panel.grid.major.x = element_blank())

According to these graphs of transactions and revenue, the age groups that present the largest opportunity for growth on revenue are 19-24, 55-64, and 65+. However, it is imperative that we look at why this trend may exist in our data. Specifically, is this discrepancy in total transactions and revenue explained simply by our sample size?

#Number of households for each age group
demographics %>%
  group_by(age) %>%
  summarise(count = n())
## # A tibble: 6 × 2
##   age   count
##   <ord> <int>
## 1 19-24    46
## 2 25-34   142
## 3 35-44   194
## 4 45-54   288
## 5 55-64    59
## 6 65+      72

It appears that our limited sample size may explain the massive difference seen in transactions and revenue across the age groups. Therefore, to continue our analysis, it is imperative that we begin looking at measures that reduce the impact our sample size has on the data. In simple terms, we decided to pivot away from the use of cumulative measures for analysis. A more fitting measure to assess the spending habits of age groups is to look at the average baskets sales value: how much each age group is spending on average for each trip to Regork.

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(basket_id, age) %>%
  summarize(total_basket_sales = sum(sales_value)) %>%
  group_by(age) %>%
  summarize(avg_sales_value = mean(total_basket_sales, na.rm = TRUE)) %>% 
  ggplot(aes(x = avg_sales_value, y = age)) +
  geom_segment(aes(xend = 25, yend = age), color = "gray") +
  geom_dotplot(binaxis = "y", stackdir = "center", 
               dotsize = 2, color = "dodgerblue", fill = "dodgerblue") +
  scale_x_continuous( 
                     label = scales::dollar) +
  labs(
    title = "Average Basket Sales Value by Age Group",
    x = "Average Basket Sales Value",
    y = "Age Group",
    caption = "Data: completejourney"
  ) +
  theme_minimal() +
  theme(panel.grid.major.y = element_blank())

This graph shows us that the 19-24 ($26.00/trip) and 65+ ($26.70/trip) age groups are the lowest spenders when entering the store. This would confirm that these age groups have the most room for growth in our analysis of expanding spending. We are more confident in this data than the cumulative data examined prior. We decided to also analyze the average spend for each month as it will be easier to compare to the average amount of trips each age group makes to Regork each month. Additionally, this provides the benefit of being able to analyze at a more granular scale, removing potential outliers that may have skewed the overall average data.

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(month = month(transaction_timestamp)) %>%
  group_by(household_id, month, age) %>%
  summarize(total_spend = sum(sales_value, na.rm = TRUE)) %>%
  group_by(month,age) %>%
  summarize(avg_spend = mean(total_spend, na.rm = TRUE)) %>%
  ggplot(aes(x = factor(month, labels = month.abb), group = 1)) +
    geom_line(aes(y = avg_spend), color = "dodgerblue") +
    geom_point(aes(y = avg_spend), color = "blue3", size = 1) +
    labs(
      title = "Average Spend per Month by Age Group",
      x = "Month",
      y = "Average Spend",
      caption = "Data: completejourney"
    ) +
    scale_y_continuous(labels = scales::dollar) +
    facet_wrap(~age) +
    theme_bw()

As the graph shows, there is a bit of variability with how much each age range spends in any given month of the year. To get back to the original question being uncovered, we decided this data would best be used if we showed the average number of trips to Regork for each month. This will allow us to see if we can capitalize on certain periods of the year where certain age groups are in the store, but not spending as much as we would like.

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(month = month(transaction_timestamp)) %>%
  group_by(household_id, month, age) %>%
  summarize(total_trips = n_distinct(basket_id)) %>%
  group_by(month, age) %>%
  summarize(avg_trips = mean(total_trips, na.rm = TRUE)) %>% 
  ggplot(aes(x = factor(month, labels = month.abb), group = 1)) +
    geom_line(aes(y = avg_trips), color = "dodgerblue") +
    geom_point(aes(y = avg_trips), color = "blue3", size = 1) +
    labs(
      title = "Average Trips by Month and Age Group",
      x = "Month",
      y = "Average Trips",
      caption = "Data: completejourney"
    ) +
  facet_wrap(~age) +
  theme_bw()

After seeing these charts, we believe there is room to improve spending for the 19-24 age range. Even though the lowest average basket sales value, we believe we can capitalize on them visiting the store up to 11 times per month. Next, we want to see the income range for each age group.

demographics %>%
  group_by(age, income) %>%
  summarise(count = n()) %>%
  ungroup() %>%
  ggplot(aes(x = income, y = count)) +
  geom_bar(stat = "identity", fill = "dodgerblue") +
  labs(
    title = "Income Distribution by Age Group",
    x = "Income Range",
    y = "Number of Households"
  ) +
  facet_wrap(~age) +
  theme_bw() +
  theme(panel.grid.major.y = element_blank()) +
  coord_flip()

Although 19-24 year olds do not have the highest income (as expected), we believe that we can increase spending on cheap items, and drive sales from having them buy other items related to what they usually buy the most. To uncover this, we examined a tree diagram of the top 10 products that 19-24 year olds buy.

transactions %>%
  inner_join(products, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  filter(age == "19-24") %>%
  group_by(age, product_category) %>%
  summarise(count = n()) %>%
  slice_max(order_by = count, n = 10) %>%
  plot_ly(
    labels = ~product_category,
    parents = ~age,
    values = ~count,
    type = "treemap",
    text = ~paste("Transactions: ", count),
    marker = list(colors = rainbow(length(unique(.$age))))
  ) %>%
  layout(title = "Top 10 Product Categories by Total Transactions (19-24)")

As seen in the treemap, soft drinks are comfortably the most common product bought by this age group. If we want to run some sort of campaign, we want to include soft drinks and one of the other categories that is frequently bought by this age group. To make it convenient for the shopper, we want to examine the product categories most commonly purchased with soft drinks. This should drive customers to buy more of this other item, and with that, we hope to retain their loyalty and have them increase spending on other products.

#first find baskets including soft drinks
soft_drink_baskets <- transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(product_category == "SOFT DRINKS") %>%
  inner_join(demographics, by = "household_id") %>%
  filter(age == "19-24") %>%
  select(basket_id) %>%
  distinct() %>%
  pull()

#filter transactions for soft drink baskets 
prod_bought_with_soft_drink <- transactions %>%
  filter(basket_id %in% soft_drink_baskets) %>%
  inner_join(products, by = "product_id") %>%
  filter(product_category != "SOFT DRINKS") 

#graph products 
prod_bought_with_soft_drink %>%
  group_by(product_category) %>%
  summarize(count = n()) %>%
  slice_max(order_by = count, n = 5) %>%
  ggplot(aes(x = reorder(product_category, count), y = count)) +
    geom_col(fill = "dodgerblue") +
    coord_flip() + 
    geom_text(aes(label = count), 
              hjust = 1.2, 
              vjust = 0.6, 
              size = 6, 
              fontface = "bold",
              color = "white") +
    theme_minimal() +  
    theme(axis.text.x = element_blank(),  
          axis.ticks.x = element_blank(),
          panel.grid.major.x = element_blank()) +
    labs(title = "Top 5 Product Categories Bought with Soft Drinks",
         subtitle = "Only transactions made by 19-24 year-olds",
         y = "Number of Transactions",
         x = NULL,
         caption = "Data: completejourney"
        )

The top three product categories above are all pretty close to each other. We decided that the product here that makes the most sense to sell with soft drinks would be bagged snacks. This could easily fit the narrative for 19-24 year olds that are in college/freshly removed from college. The next part of the analysis is determining the usage of coupons by 19-24 year olds. Specifically, we must assess whether or not a campaign featuring coupons would help us reach our goal of increasing sales.

#determine the number of households that have redeemed coupons
coupon_hshds <- coupon_redemptions %>%
  inner_join(demographics) %>%
  filter(age == "19-24") %>%
  summarize(count_coupon_hshd = n_distinct(household_id)) %>%
  pull()

#number of households
hshd <- demographics %>%
  filter(age == "19-24") %>%
  group_by(age) %>%
  summarise(hshd = n_distinct(household_id)) %>%
  pull()

#percent of 19-24 year old households that have used coupons in the past year
percentage <- round((coupon_hshds/hshd)*100, digits = 2)
percentage
## [1] 19.57
#create dataframe for pie chart
pie_data <- data.frame(
  Category = c("Used Coupons", "Did Not Use Coupons"),
  Percentage = c(percentage, 100 - percentage)
)

#create pie chart
ggplot(pie_data, aes(x = "", y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Household Coupon Usage for 19-24 Year-Olds",
       fill = "Coupon Usage",
       caption = "Data: completejourney") +
  geom_text(data = subset(pie_data, Category == "Used Coupons"), 
            aes(label = paste0(Percentage, "%")), 
                vjust = -3, 
                hjust = 0.6, 
                color = "white",
                fontface = "bold",
                size = 6) +
  theme_void() +
  scale_fill_manual(values = c("Used Coupons" = "dodgerblue", 
                               "Did Not Use Coupons" = "grey"))

Now that we know about how many 19-24 year olds are using coupons, it is important to assess the effectiveness of coupon usage on increasing revenue.

#average basket sales value when coupons are used
avg_basket_sales_with_coupons <- transactions %>%
  inner_join(demographics, by = "household_id") %>%
  filter(age == "19-24") %>%
  mutate(coupon_discount = coupon_disc + coupon_match_disc) %>%
  group_by(basket_id, age) %>%
  summarize(total_basket_sales = sum(sales_value, na.rm = TRUE),
            total_basket_discount = sum(coupon_discount, na.rm = TRUE)) %>%
  filter(total_basket_discount > 0) %>%
  group_by(age) %>%
  summarize(avg_sales_value = mean(total_basket_sales, na.rm = TRUE))

#average basket sales value
avg_basket_sales_without_coupons <- transactions %>%
  inner_join(demographics, by = "household_id") %>%
  filter(age == "19-24") %>%
  group_by(basket_id, age) %>%
  summarize(total_basket_sales = sum(sales_value, na.rm = TRUE)) %>%
  group_by(age) %>%
  summarize(avg_sales_value = mean(total_basket_sales, na.rm = TRUE))

#create a dataframe for plotting
coupon_spending_data <- data.frame(
  category = c("With Coupons", "Without Coupons"),
  avg_basket_sales = c(avg_basket_sales_with_coupons$avg_sales_value, 
                   avg_basket_sales_without_coupons$avg_sales_value),
  age = "19-24"
)

# plot average basket sales value for baskets that used coupons and baskets that didnt
ggplot(coupon_spending_data, aes(x = category, y = avg_basket_sales, fill = category)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.5) +
  labs(title = "Average Basket Sales Value by Coupon Usage for 19-24 Year-Olds",
       y = "Basket Sales Value",
       x = "Coupon Usage",
       caption = "Data: completejourney") +
  scale_y_continuous(labels = scales::dollar) +
  theme_minimal() +
  scale_fill_manual(values = c("With Coupons" = "dodgerblue", 
                               "Without Coupons" = "grey")) +
  theme(panel.grid.major.x = element_blank()) +
  guides(fill = FALSE)

It appears that the average basket sales value is more than doubled when coupons are used by 19-24 year olds - the average basket sales value for baskets that utilized coupons is $58.50 compared to just $26.00 for baskets that did not use coupons. Considering only about one-fifth of 19-24 year olds are using coupons, it would be most beneficial to promote the use of coupons.

Summary

Our final findings suggest that, in order to increase revenue to frequent visitors but low spenders, Regork should run a coupon campaign targeted towards the 19-24 age group for bagged snacks. Our reasoning for targeting bagged snacks is two-fold: (1) bagged snacks can be bought in large quantities and (2) snacks and soft drinks are a commonly bought combination for 19-24 year-olds. Although this consumer group may have less disposable income than the others, we are confident that implementing a campaign like this will increase revenue as we are not targeting high ticket items. Instead, we are simply encouraging the shoppers to increase their basket quantity, thus increasing their basket sales value. This correlation was backed by our analysis, in which we uncovered a higher average basket sales value for those that utilized coupons. Additionally, we believe that our campaign will help retain current customers because of their savings, and it has the potential to attract new customers.

The exact coupon itself can be implemented by the operations and marketing team, but we suggest that the campaign be suggested towards buying bagged snacks in a heavy volume - perhaps a buy 4 get 1 type of event for soft drinks and bagged snacks. Encouraging consumers to buy this combination of products is not a crazy thought, as many of them already buy these together, but hopefully this drives them to purchase more of each item every time they visit Regork. This thought process will encourage the consumers to spend more money than they otherwise would have in the store, and in turn, boost current and future revenue at Regork.

It should be noted that there are limitations to this position. As noted in our analysis, there is a limited sample size for the 19-24 age range, with 19-24 year olds representing only 46 of the 808 households. Of course, a larger sample size would be beneficial for anyone looking to replicate our analysis.

Another limitation of our analysis is the breadth of which we analyzed the impact of coupons, campaigns, and promotions on overall spending behavior. While we believe we captured a glimpse into the effect, a more detailed look would certainly benefit the analysis in the long run, and uncover a more detailed/flushed out solution. Specifically, research done on the practicality of using coupons for bagged snacks particularly would be beneficial in predicting it viability and affect. Although we recognize this, such analysis did not seem feasible given our already limited sample size and history of transactions for the age group. However, we still remain confident that marketing our solution in the correct manner would result in our desired outcome.